Version: Next

Generate (Buffered)

POST /generate

Generate (Buffered)

The /generate endpoint is used to communicate with the LLM. Use this endpoint when you want to receive a full response from the LLM, all at once. If you want your response to stream token by token, See the /generate_stream endpoint.

To send a batch of requests all at once, the text field can be either a string, or an array of strings. This server also supports dynamic batching, where requests in a short time interval are processed as a single batch.

Request

application/json

Body

required

constrained_decoding_backend stringnullable

consumer_group stringnullable

json_schema nullable

max_new_tokens int64nullable

min_new_tokens int64nullable

no_repeat_ngram_size int64nullable

prompt_max_tokens int64nullable

regex_string stringnullable

repetition_penalty floatnullable

sampling_temperature floatnullable

sampling_topk int64nullable

sampling_topp floatnullable

text

object

required

Input Text used for ease of users not to have to use the clunky PayloadText. Mapping provided below to convert InputText to PayloadText.

oneOf

MOD1
MOD2

string

Responses

Takes in a JSON payload and returns the response all at once.

application/json

Schema
Example (from schema)

Schema

text

object

required

Input Text used for ease of users not to have to use the clunky PayloadText. Mapping provided below to convert InputText to PayloadText.

oneOf

MOD1
MOD2

string

{
  "text": "string"
}

Generate (Buffered)

/generate

Request​

Body

Responses​

Request

Responses